Back to Daniel’s Portfolio



Case Study 2

Employee Attrition and Salary Analysis and Prediction Models

Import Libraries

Importing libraries to use for project.

Read in full employee dataset

### Data Imports
employeeData = read.csv("CaseStudy2-data.csv", sep = ",")
employee_original = employeeData # make copy of original dataset
head(employeeData)
##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
## 6  6  27        No Travel_Frequently       294 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
## 6               10         2    Life Sciences             1            733
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
## 6                       4   Male         32              3        3
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
## 6 Manufacturing Director               1      Divorced          8793
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
## 6        4809                  1      Y       No                21
##   PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1                 3                        3            80                1
## 2                 3                        1            80                0
## 3                 3                        3            80                0
## 4                 3                        3            80                2
## 5                 3                        3            80                0
## 6                 4                        3            80                2
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 8                     3               2              5
## 2                21                     2               4             20
## 3                10                     2               3              2
## 4                14                     3               3             14
## 5                 6                     2               3              6
## 6                 9                     4               2              9
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  2                       0                    3
## 2                  7                       4                    9
## 3                  2                       2                    2
## 4                 10                       5                    7
## 5                  3                       1                    3
## 6                  7                       1                    7

Cleaning Dataset

In this section , I will be checking for nulls and encoding columns with characters variables to numeric variables. There is nothing to clean here as there are no nulls in the dataset and the data is high in integrity.

### Check for Missing Values
sum(is.na(employeeData))
## [1] 0
# Make overtime & Attrition column binary 
employeeData$cleanOverTime = ifelse(employeeData$OverTime=="Yes",1,0)
employeeData$cleanAttrition = ifelse(employeeData$Attrition=="Yes",1,0)

Attrition Exploratory Data Analysis(EDA)

First, we started of will looking at the attrition count. From our bar chart, we find that the “No” Attrition outnumbers the “Yes” by nearly 7 to 1.

Some variables that should be included are Monthly Income, Overtime, Job Level and Job Role.

Our Attrition vs. Monthly Income histogram shows that as the salary of an individual increases, they are less likely to quit. Next, we find that those who work overtime are more likely ot quit. A little over 25% of those who work overtime quite which makes sense since they are more prone to stress. Job Role seems to be a big factor in those who quite as well. Almost 50% of those who work as sales representative quit and about 25% of those who work in human resources do.

Furthermore, from our pair plots, we find that the length some works is somewhat correlated with the attrition rate as well. The correlation hovers between 0.59 and0.78 which a strong and positive correlation, not very strong. However, it is enough to make an impact.

#### Exploratory Data Analysis
#### Attrition EDA
employeeData %>% ggplot(aes(x=Attrition,fill=Attrition)) + 
  geom_bar()+
  ggtitle("Attrition Count") +
  xlab("Attrition")+ylab("Count")

### Percentage Compares for Job Role
ggplot(employeeData, aes(x = JobRole, fill = Attrition)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent)+ 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Attrition Rate By Job Role")

### Attrition Vs. Job Satisfaction
employeeData %>% 
  ggplot(aes(x=JobSatisfaction,fill=Attrition))+
  geom_bar()+
  ggtitle("Attrition Vs. Job Satisfaction") 

### Attrition Vs. MonthlyIncome
employeeData %>% ggplot(aes(x=MonthlyIncome,fill=Attrition))+
  geom_histogram()+
  ggtitle("Attrition Vs. MonthlyIncome") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. DistanceFromHome
employeeData %>% 
  ggplot(aes(x=DistanceFromHome,fill=Attrition))+
  geom_histogram()+
  ggtitle("Attrition Vs. DistanceFromHome") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. MonthlyRate
employeeData %>% ggplot(aes(x=MonthlyRate,fill=Attrition))+geom_histogram()+ggtitle("Attrition Vs. MonthlyRate") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. OverTime
employeeData %>% 
  ggplot(aes(x=OverTime,fill=Attrition))+
  geom_bar(position="fill")+ggtitle("Attrition Vs. Overtime")+
  scale_y_continuous(labels = scales::percent)

### Attrition Vs. years Since last Promotion
employeeData %>% 
  ggplot(aes(x=YearsSinceLastPromotion,fill=Attrition))+
  geom_bar(position="fill")+ggtitle("Attrition Vs. Years Since Last Promotion") +  
  scale_y_continuous(labels = scales::percent)

### Attrition Vs. Salary Hike - NO
employeeData %>% 
  ggplot(aes(x=PercentSalaryHike,fill=Attrition))+
  geom_histogram(position="fill")+
  ggtitle("Attrition Vs. Percent Salary Hike") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing missing values (geom_bar).

### Attrition Vs. Age
employeeData %>% 
  ggplot(aes(x=Age,fill=Attrition))+
  geom_histogram()+
  ggtitle("Attrition Vs. Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition vs Marital Status
employeeData %>% 
  ggplot(aes(x=MaritalStatus,fill=Attrition))+
  geom_bar(position="fill")+
  ggtitle("Attrition Vs. Marital Status")

### Attrition Vs. PercentSalaryHike
employeeData %>% 
  ggplot(aes(x=PercentSalaryHike,fill=Attrition))+
  geom_histogram()+ggtitle("Attrition Vs. PercentSalaryHike")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. WorkLifeBalance
employeeData %>% 
  ggplot(aes(x=WorkLifeBalance,fill=Attrition))+
  geom_histogram()+ggtitle("Attrition Vs. WorkLifeBalance")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. StockOptionLevel
employeeData %>% 
  ggplot(aes(x=StockOptionLevel,fill=Attrition))+
  geom_histogram()+ggtitle("Attrition Vs. StockOptionLevel")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. TrainingTimesLastYear
employeeData %>% 
  ggplot(aes(x=TrainingTimesLastYear,fill=Attrition))+
  geom_histogram()+
  ggtitle("Attrition Vs. TrainingTimesLastYear")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. PerformanceRating
employeeData %>% 
  ggplot(aes(x=PerformanceRating,fill=Attrition))+
  geom_histogram()+ggtitle("Attrition Vs. PerformanceRating")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. RelationshipSatisfaction
employeeData %>% 
  ggplot(aes(x=RelationshipSatisfaction,fill=Attrition))+
  geom_histogram()+ggtitle("Attrition Vs. RelationshipSatisfaction")+
  scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Attrition Vs. Joblevel
employeeData %>% 
  ggplot(aes(x=JobLevel,fill=Attrition))+
  geom_bar()+ggtitle("Attrition Vs. Joblevel")+
  scale_y_continuous(labels = scales::percent)

### Attrition Vs. BusinessTravel
employeeData %>% 
  ggplot(aes(x=BusinessTravel,fill=Attrition))+
  geom_bar()+ggtitle("Attrition Vs. BusinessTravel")+
  scale_y_continuous(labels = scales::percent)

#### Monthly Income EDA
### Compare Incomes
employeeData %>% group_by(Attrition) %>% summarise(compareincomes=mean(MonthlyIncome))
## # A tibble: 2 × 2
##   Attrition compareincomes
##   <chr>              <dbl>
## 1 No                 6702 
## 2 Yes                4765.
### Job Role vs. Monthly Salary
employeeData %>% 
  ggplot(aes(x=JobRole,y = MonthlyIncome, fill = JobRole))+
  geom_boxplot()+ggtitle("Monthly Income vs. Job Role")+
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1)) 

ggpairs(employeeData[,c(3,30, 33:36)], aes(color = Attrition))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

employeeData %>% 
  ggplot(aes(x=TotalWorkingYears,y = MonthlyIncome))+
  geom_point()+ggtitle("Monthly Income vs. TotalWorkingYears")+
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1)) 

employeeData %>% 
  ggplot(aes(x=BusinessTravel,y = MonthlyIncome))+
  geom_boxplot()+ggtitle("Monthly Income vs. BusinessTravel")+
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))

KNN Model

### KNN Model

## Classify
employeeData2 = employeeData
employeeData2$Attrition = as.factor(employeeData2$Attrition)

# create dataset for KNN model
model = employeeData2[,-c(1,4,5,6,8,9,10,11,12,13,14,17,19,21,23,24,28,38)]

# oversample to make up for imbalance in dataset
model = oversample(model,classAttr = "Attrition",method = "ADASYN")

## ML
set.seed(124) # Changed seed multiple times to see how high it can go
iterations = 200
numks = 20
splitPerc = .70
masterAcc = matrix(nrow = iterations, ncol = numks)
for(j in 1:iterations)
{
  trainIndices = sample(1:dim(model)[1],round(splitPerc * dim(model)[1]))
  train = model[trainIndices,]
  test = model[-trainIndices,]
  for(i in 1:numks)
  {
    classifications = knn(train[,-c(2)],test[,-c(2)],train$Attrition, prob = TRUE, k = i)
    table(classifications,test$Attrition)
    CM = confusionMatrix(table(classifications,test$Attrition))
    masterAcc[j,i] = CM$overall[1]
  }
  
}
MeanAcc = colMeans(masterAcc)
plot(seq(1,numks,1),MeanAcc, type = "l")

which.max(MeanAcc)
## [1] 1
max(MeanAcc)
## [1] 0.7058681

Confusion Matrix for KNN model

As we build and analyze our KNN model, we find that KNN is not the best model to use. Our accuracy is abysmal. Our highest sensitivity is only slightly better at .6471 while our specificity is worse at 0.7719. As a result, we will move on to a different model.

classifications = knn(train[,-c(2)],test[,-c(2)],train$Attrition, prob = TRUE, k = 3)
table(classifications,test$Attrition)
##                
## classifications  No Yes
##             No  138  52
##             Yes  79 163
CM = confusionMatrix(table(classifications,test$Attrition))
CM
## Confusion Matrix and Statistics
## 
##                
## classifications  No Yes
##             No  138  52
##             Yes  79 163
##                                          
##                Accuracy : 0.6968         
##                  95% CI : (0.651, 0.7398)
##     No Information Rate : 0.5023         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.3939         
##                                          
##  Mcnemar's Test P-Value : 0.02311        
##                                          
##             Sensitivity : 0.6359         
##             Specificity : 0.7581         
##          Pos Pred Value : 0.7263         
##          Neg Pred Value : 0.6736         
##              Prevalence : 0.5023         
##          Detection Rate : 0.3194         
##    Detection Prevalence : 0.4398         
##       Balanced Accuracy : 0.6970         
##                                          
##        'Positive' Class : No             
## 

Naive-Bayes Model

Now, want to use Naive-bayes to build our model. This model yielded us much better results with an accuracy of 0.8884, a Specificity of 0.9019 and sensitivity 0.8750. As a result, we find this model to be quite accurate. Therefore, the Naive-bayes model that we’ve built to predict attriton on the test dataset.

set.seed(25)
naive_data=employeeData

naive_data$Attrition = as.factor(naive_data$Attrition)

model2 = naive_data[,-c(1,2,4,5,7,9,10,11,13,
                        14,21:23,25,28,37,38)]

## get data ready for oversampling
#Convert all character variables to factors
model2 = model2 %>%
  mutate_if(sapply(model2, is.character), as.factor)
#Convert all character variables to integer 
model2 = model2 %>%
  mutate_if(sapply(model2, is.factor), as.integer)

#oversample
model2 = oversample(model2,classAttr = "Attrition",method = "ADASYN")

#revert interger back to factor
model2 = model2 %>%
  mutate_if(sapply(model2, is.integer), as.factor)

#change Attrition back to character and factor
model2$Attrition <- (ifelse(model2$Attrition==1,"No","Yes"))
model2$Attrition <- as.factor(model2$Attrition)

model2$Attrition = as.factor(model2$Attrition)
trainIndices = sample(1:dim(model2)[1],round(.70 * dim(model2)[1]))
train = model2[trainIndices,]
test = model2[-trainIndices,]

classifier1 = naiveBayes(Attrition~., data =model2)

pred = predict(classifier1,newdata=test)
CM = confusionMatrix(table(test$Attrition,pred))

CM
## Confusion Matrix and Statistics
## 
##      pred
##        No Yes
##   No  189  21
##   Yes  27 193
##                                           
##                Accuracy : 0.8884          
##                  95% CI : (0.8547, 0.9165)
##     No Information Rate : 0.5023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7768          
##                                           
##  Mcnemar's Test P-Value : 0.4705          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.9019          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.8773          
##              Prevalence : 0.5023          
##          Detection Rate : 0.4395          
##    Detection Prevalence : 0.4884          
##       Balanced Accuracy : 0.8884          
##                                           
##        'Positive' Class : No              
## 

Read in NoAttrition dataset

employee_noattri = read.csv("CaseStudy2CompSet No Attrition (1).csv")
#employee_noattri

Predict Attrition

In this code block, we will use our Naive-Bayes model to predict attrition and store it in a variable.

# Convert overtime to binary 
employee_noattri$cleanOverTime = ifelse(employee_noattri$OverTime=="Yes",1,0)

employee_noattri = employee_noattri %>%
  mutate_if(sapply(employee_noattri, is.character), as.factor)
#Convert all character variables to integer 
employee_noattri = employee_noattri %>%
  mutate_if(sapply(employee_noattri, is.factor), as.integer)
employee_noattri = employee_noattri %>%
  mutate_if(sapply(employee_noattri, is.integer), as.factor)


attrition_pred = predict(classifier1, employee_noattri)
attrition_pred
##   [1] No  No  Yes No  No  No  No  No  No  Yes No  Yes No  No  No  No  No  No 
##  [19] No  No  Yes No  No  No  No  No  Yes Yes Yes Yes No  No  No  No  Yes No 
##  [37] No  No  No  No  No  No  No  No  Yes No  No  Yes Yes No  No  Yes No  No 
##  [55] No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  Yes No 
##  [73] No  No  No  Yes No  No  No  No  No  No  Yes No  No  No  No  No  No  Yes
##  [91] No  No  No  Yes No  No  No  No  Yes Yes No  No  No  No  Yes No  No  No 
## [109] No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
## [127] Yes No  No  No  No  No  No  Yes No  No  No  Yes No  Yes No  No  No  No 
## [145] No  No  Yes Yes No  No  No  No  No  No  No  Yes No  No  Yes No  Yes Yes
## [163] Yes No  No  No  No  No  No  No  Yes No  Yes Yes No  No  Yes Yes No  No 
## [181] Yes No  No  No  No  No  No  Yes No  No  No  No  No  No  Yes No  No  No 
## [199] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No 
## [217] Yes No  No  No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  Yes
## [235] No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  Yes Yes No 
## [253] No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
## [271] No  Yes No  Yes Yes No  Yes No  Yes Yes No  No  No  No  Yes Yes No  Yes
## [289] No  Yes No  No  Yes No  Yes No  No  Yes No  No 
## Levels: No Yes

Add attrition prediction to dataset

employee_noattri$AttritionPred = attrition_pred
#employee_noattri

Write csv file for attrition prediction

filtered_noattrition = employee_noattri %>% 
  select(ID, AttritionPred) %>% arrange(ID)
head(filtered_noattrition)
##     ID AttritionPred
## 1 1171            No
## 2 1172            No
## 3 1173           Yes
## 4 1174            No
## 5 1175            No
## 6 1176            No
write.csv(filtered_noattrition, "Case2PredictionsChang Attrition.csv")

Read in NoSalary dataset

employee_nosalary = read_excel("CaseStudy2CompSet No Salary (2).xlsx")

Finding Variables for Best Linear Regression Model

We will create a linear regression model to test all of the variables to decide on which one will be selected for our final linear regression model.To incoporate all the variables, I will be creating dummy columns to make various columns binary, I can use them in this model.

employeeData3 = employeeData2
employeeData3 = dummy_cols(employeeData3, 
                           select_columns = c("BusinessTravel","Department",
                                              "EducationField","Gender",
                                              "JobRole", "MaritalStatus" ))

employeeData3$JobRole_Others = ifelse(employeeData3$JobRole_Manager == 1|employeeData3$`JobRole_Research Director` == 1|
                                        employeeData3$`JobRole_Sales Executive` == 1, 0, 1)
employeeData3$BusinessTravel_Others = ifelse(employeeData3$`BusinessTravel_Non-Travel` == 1, 0, 1)

employeeData3[,c(39:66)] = employeeData3[,c(39:66)] %>%
  mutate_if(sapply(employeeData3[,c(39:66)], is.numeric), as.factor)

lm_salarydf = employeeData3[,c(2,5,7,8,12,14,15,16,18, 20,
                              21,25,26,27,29,30, 31,32,33,34,
                              35,36,37, 39:64)]

lmsalary_model = lm(MonthlyIncome~.,
                    data = lm_salarydf)
summary(lmsalary_model)
## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = lm_salarydf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3632.4  -659.6     2.9   620.1  4165.7 
## 
## Coefficients: (6 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           3.819e+02  5.470e+02   0.698  0.48531    
## Age                                  -1.510e+00  5.630e+00  -0.268  0.78867    
## DailyRate                             1.450e-01  9.122e-02   1.590  0.11222    
## DistanceFromHome                     -6.536e+00  4.531e+00  -1.442  0.14956    
## Education                            -3.259e+01  3.689e+01  -0.883  0.37735    
## EnvironmentSatisfaction              -6.510e+00  3.349e+01  -0.194  0.84592    
## HourlyRate                           -3.380e-01  1.824e+00  -0.185  0.85307    
## JobInvolvement                        1.065e+01  5.236e+01   0.203  0.83894    
## JobLevel                              2.785e+03  8.341e+01  33.394  < 2e-16 ***
## JobSatisfaction                       2.314e+01  3.295e+01   0.702  0.48276    
## MonthlyRate                          -9.379e-03  5.141e-03  -1.824  0.06844 .  
## PercentSalaryHike                     2.532e+01  1.582e+01   1.600  0.10989    
## PerformanceRating                    -3.237e+02  1.615e+02  -2.004  0.04537 *  
## RelationshipSatisfaction              1.514e+01  3.309e+01   0.457  0.64749    
## StockOptionLevel                      4.146e+00  5.682e+01   0.073  0.94185    
## TotalWorkingYears                     5.204e+01  1.043e+01   4.990 7.37e-07 ***
## TrainingTimesLastYear                 2.171e+01  2.904e+01   0.748  0.45497    
## WorkLifeBalance                      -3.909e+01  5.139e+01  -0.761  0.44706    
## YearsAtCompany                       -5.591e+00  1.321e+01  -0.423  0.67218    
## YearsInCurrentRole                    4.987e+00  1.699e+01   0.294  0.76916    
## YearsSinceLastPromotion               3.160e+01  1.522e+01   2.077  0.03813 *  
## YearsWithCurrManager                 -2.652e+01  1.666e+01  -1.591  0.11190    
## cleanOverTime                         2.343e+00  8.080e+01   0.029  0.97687    
## `BusinessTravel_Non-Travel`1         -3.787e+02  1.198e+02  -3.160  0.00163 ** 
## BusinessTravel_Travel_Frequently1    -1.773e+02  9.700e+01  -1.827  0.06801 .  
## BusinessTravel_Travel_Rarely1                NA         NA      NA       NA    
## `Department_Human Resources`1         4.256e+02  4.871e+02   0.874  0.38258    
## `Department_Research & Development`1  5.625e+02  3.307e+02   1.701  0.08936 .  
## Department_Sales1                            NA         NA      NA       NA    
## `EducationField_Human Resources`1    -7.526e+01  3.841e+02  -0.196  0.84472    
## `EducationField_Life Sciences`1       3.728e+01  1.366e+02   0.273  0.78490    
## EducationField_Marketing1             9.674e+00  1.794e+02   0.054  0.95700    
## EducationField_Medical1              -7.245e+01  1.416e+02  -0.512  0.60892    
## EducationField_Other1                -1.773e+01  1.951e+02  -0.091  0.92760    
## `EducationField_Technical Degree`1           NA         NA      NA       NA    
## Gender_Female1                       -1.125e+02  7.442e+01  -1.512  0.13097    
## Gender_Male1                                 NA         NA      NA       NA    
## `JobRole_Healthcare Representative`1 -9.286e+01  3.907e+02  -0.238  0.81219    
## `JobRole_Human Resources`1           -2.787e+02  5.158e+02  -0.540  0.58919    
## `JobRole_Laboratory Technician`1     -6.875e+02  3.705e+02  -1.855  0.06388 .  
## JobRole_Manager1                      4.186e+03  3.565e+02  11.744  < 2e-16 ***
## `JobRole_Manufacturing Director`1     7.693e+01  3.884e+02   0.198  0.84304    
## `JobRole_Research Director`1          3.960e+03  4.352e+02   9.098  < 2e-16 ***
## `JobRole_Research Scientist`1        -4.370e+02  3.699e+02  -1.181  0.23775    
## `JobRole_Sales Executive`1            4.251e+02  1.904e+02   2.233  0.02583 *  
## `JobRole_Sales Representative`1              NA         NA      NA       NA    
## MaritalStatus_Divorced1              -2.900e+01  1.343e+02  -0.216  0.82905    
## MaritalStatus_Married1                4.348e+01  1.023e+02   0.425  0.67089    
## MaritalStatus_Single1                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1056 on 827 degrees of freedom
## Multiple R-squared:  0.9498, Adjusted R-squared:  0.9472 
## F-statistic: 372.3 on 42 and 827 DF,  p-value: < 2.2e-16

Creating Linear Regression Model

Looking at the Linear Regression model that we created above, we will want to select all of the columns where the p-value yields a significant result. We will take those values and add it to our final linear regression model to predict salary. Please note that column 39-44 indicates all the Business travel levels while 53-61 refers to all the Job Role.

lm_salarydf = employeeData3[,c(16, 20,21, 26, 30, 35,39:44, 53:61)] # 39-44 is Business Travel, 53-61 is Jobe Role

lmsalary_model2 = lm(MonthlyIncome~., 
                    data = lm_salarydf)
summary(lmsalary_model2)
## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = lm_salarydf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3777.9  -633.7    -4.8   619.2  4107.0 
## 
## Coefficients: (3 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           1.448e+02  3.627e+02   0.399 0.689770    
## JobLevel                              2.779e+03  8.153e+01  34.086  < 2e-16 ***
## MonthlyRate                          -9.052e-03  5.056e-03  -1.790 0.073748 .  
## PerformanceRating                    -1.379e+02  1.004e+02  -1.373 0.170137    
## TotalWorkingYears                     4.499e+01  8.342e+00   5.393 8.99e-08 ***
## YearsSinceLastPromotion               1.564e+01  1.275e+01   1.227 0.220322    
## `BusinessTravel_Non-Travel`1         -3.983e+02  1.173e+02  -3.395 0.000718 ***
## BusinessTravel_Travel_Frequently1    -1.972e+02  9.512e+01  -2.073 0.038450 *  
## BusinessTravel_Travel_Rarely1                NA         NA      NA       NA    
## `Department_Human Resources`1         3.978e+02  4.425e+02   0.899 0.368936    
## `Department_Research & Development`1  5.527e+02  3.232e+02   1.710 0.087578 .  
## Department_Sales1                            NA         NA      NA       NA    
## `JobRole_Healthcare Representative`1 -1.113e+02  3.865e+02  -0.288 0.773496    
## `JobRole_Human Resources`1           -3.424e+02  5.078e+02  -0.674 0.500233    
## `JobRole_Laboratory Technician`1     -7.331e+02  3.664e+02  -2.001 0.045706 *  
## JobRole_Manager1                      4.156e+03  3.511e+02  11.837  < 2e-16 ***
## `JobRole_Manufacturing Director`1     1.860e+01  3.840e+02   0.048 0.961381    
## `JobRole_Research Director`1          3.921e+03  4.295e+02   9.129  < 2e-16 ***
## `JobRole_Research Scientist`1        -4.694e+02  3.655e+02  -1.284 0.199417    
## `JobRole_Sales Executive`1            3.681e+02  1.852e+02   1.988 0.047168 *  
## `JobRole_Sales Representative`1              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1053 on 852 degrees of freedom
## Multiple R-squared:  0.9485, Adjusted R-squared:  0.9475 
## F-statistic: 923.9 on 17 and 852 DF,  p-value: < 2.2e-16
RSS = c(crossprod(lmsalary_model2$residuals))
MSE = RSS / length(lmsalary_model2$residuals)
RMSE = sqrt(MSE)
sig2 = RSS / lmsalary_model2$df.residual

RMSE
## [1] 1042.313

Predicting Salary

# create dummy variables
employee_nosalary= dummy_cols(employee_nosalary, 
                           select_columns = c("BusinessTravel","Department",
                                              "EducationField","Gender",
                                              "JobRole", "MaritalStatus" ))

employee_nosalary$cleanOverTime = ifelse(employee_nosalary$OverTime=="Yes",1,0)
employee_nosalary$JobRole_Others = ifelse(employee_nosalary$JobRole_Manager == 1|
                                            employee_nosalary$`JobRole_Research Director` ==1|
                                            employee_nosalary$`JobRole_Sales Executive` == 1, 0, 1)
employee_nosalary$BusinessTravel_Others = ifelse(employee_nosalary$`BusinessTravel_Non-Travel` == 1, 0, 1)

# change dummy variables to factors
employee_nosalary[,c(36:63)] = employee_nosalary[,c(36:63)] %>%
  mutate_if(sapply(employee_nosalary[,c(36:63)], is.numeric), as.factor)

# predict salary
salary_pred = predict(lmsalary_model2, employee_nosalary)
## Warning in predict.lm(lmsalary_model2, employee_nosalary): prediction from a
## rank-deficient fit may be misleading

Adding Column to No Salary Dataset and Exporting as .csv File

#salary_pred = unlist(salary_pred)

employee_nosalary$SalaryPred =salary_pred 
head(employee_nosalary)
## # A tibble: 6 × 65
##      ID   Age Attrition Busine…¹ Daily…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷
##   <dbl> <dbl> <chr>     <chr>      <dbl> <chr>     <dbl>   <dbl> <chr>     <dbl>
## 1   871    43 No        Travel_…    1422 Sales         2       4 Life S…       1
## 2   872    33 No        Travel_…     461 Resear…      13       1 Life S…       1
## 3   873    55 Yes       Travel_…     267 Sales        13       4 Market…       1
## 4   874    36 No        Non-Tra…    1351 Resear…       9       4 Life S…       1
## 5   875    27 No        Travel_…    1302 Resear…      19       3 Other         1
## 6   876    39 Yes       Travel_…     895 Sales         5       3 Techni…       1
## # … with 55 more variables: EmployeeNumber <dbl>,
## #   EnvironmentSatisfaction <dbl>, Gender <chr>, HourlyRate <dbl>,
## #   JobInvolvement <dbl>, JobLevel <dbl>, JobRole <chr>, JobSatisfaction <dbl>,
## #   MaritalStatus <chr>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## #   Over18 <chr>, OverTime <chr>, PercentSalaryHike <dbl>,
## #   PerformanceRating <dbl>, RelationshipSatisfaction <dbl>,
## #   StandardHours <dbl>, StockOptionLevel <dbl>, TotalWorkingYears <dbl>, …
## # ℹ Use `colnames()` to see all variable names
filtered_nosalary = employee_nosalary %>% select(ID, SalaryPred) %>% arrange(ID)
head(filtered_nosalary)
## # A tibble: 6 × 2
##      ID SalaryPred
##   <dbl>      <dbl>
## 1   871      5572.
## 2   872      2662.
## 3   873     12258.
## 4   874      1935.
## 5   875      2497.
## 6   876      6114.

Create csv file with salary predictions

write.csv(filtered_nosalary, "Case2PredictionsChang Salary.csv")